We start by importing the data for NYC jobs from Github and cleaning up the column names. Then we remove duplicate listings.
raw_nyc_df <- read.csv('https://raw.githubusercontent.com/mehtablocker/cuny_607/master/project_3/nyc-jobs.csv')
nyc_jobs_df <- raw_nyc_df
names(nyc_jobs_df) <- names(nyc_jobs_df) %>% tolower() %>% gsub("\\.", "_", .)
names(nyc_jobs_df)[names(nyc_jobs_df)=="x__of_positions"] <- "n_of_positions"
nyc_jobs_df <- nyc_jobs_df %>% select(-posting_type) %>% unique()
nyc_jobs_df %>% tail() %>% datatable()We filter for data science jobs by using a regular expression to search the business_title column for the case insensitive terms “data” or “analytics.” Then we create another table for non-data jobs.
data_jobs_df <- nyc_jobs_df %>% filter(grepl("data|analytics", business_title, ignore.case = T))## Warning: package 'bindrcpp' was built under R version 3.4.4
other_jobs_df <- nyc_jobs_df %>% filter(!grepl("data|analytics", business_title, ignore.case = T))
data_jobs_df %>% head() %>% datatable()We can see from the above table that a lot of key values are missing, including Job Description and Preferred Skills. This significantly limits our analysis capabilities to only a few areas.
Of all the jobs working for New York City, how many are data jobs?
### Total number of jobs in the dataset:
nrow(nyc_jobs_df)## [1] 2205
### Number of data jobs:
nrow(data_jobs_df)## [1] 71
### Data jobs, as a percentage of total:
nrow(data_jobs_df)/nrow(nyc_jobs_df)## [1] 0.03219955
In this dataset, only about 3.2 percent of jobs are data jobs.
How well do data jobs pay relative to non-data jobs?
This data set provided two salaries per job posting. The first was for the minimum, while the second was for the maximum. At first attempt to see if there was a difference between data jobs and non-data jobs, we created box plots to see the difference in distribution.
### Data jobs
summary(data_jobs_df$salary_range_to)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15 65339 84301 78740 99000 161497
### Non-data jobs
summary(other_jobs_df$salary_range_to)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.36 56703.25 75595.50 78639.47 101673.00 230000.00
par(mfrow=c(1,2))
boxplot(data_jobs_df$salary_range_to, xlab="Data Jobs", ylab="Salary in Dollars", ylim=c(0, 200000))
boxplot(other_jobs_df$salary_range_to, xlab="Non-Data Jobs", ylab="Salary in Dollars", ylim=c(0, 200000))par(mfrow=c(1,1))The distribution is wider for non-data jobs, but the median salary is higher for data jobs. It is important to remember that these are all government jobs, which overall may pay less than private sector jobs.
After analyzing this data further, we can see that minimum salary for both data sets is less than $20. These must be hourly rates, and this will unfairly skew the distribution. Therefore, we removed any salary that was less than $20,000.
salary.thresh <- 20000
data_jobs_df_new <- subset(data_jobs_df, salary_range_to >= salary.thresh & salary_range_from >= salary.thresh)
other_jobs_df_new <- subset(other_jobs_df, salary_range_to >= salary.thresh & salary_range_from >= salary.thresh)Now that we have only included annual salaries, we can conduct a statistical analysis on this data. We can run a two-sample \(t\)-test to see if there is any statistical difference between the means. In particular, we wanted to see if data jobs paid better than non-data jobs. The null and alternative hypothesis can be seen below:
\(H_{0}: \mu_{A}=\mu_{B}\)
\(H_{a}: \mu_{A}>\mu_{B}\)
where \(A\) in this case represents the data group, and \(B\) represents the non-data group.
The t-statstic can be calculated using the Welch t-statistic method since the variances do not appear to be equal. The equation is as follows:
\(t = \frac{\bar{x}_{A}-\bar{x}_{B}}{\sqrt{\frac{s_{A}^{2}}{n_{A}}+\frac{s_{B}^{2}}{n_{B}}}}\)
The degrees of freedom, \(df\), are calculated using the following equation:
\(df = \frac{\left(\frac{s_{A}^{2}}{n_{A}}+\frac{s_{B}^{2}}{n_{B}}\right)}{\frac{s_{A}^{4}}{n_{A}^{2}(n_{B}-1)}+\frac{s_{B}^{4}}{n_{B}^{2}(n_{A}-1)}}\)
Below is the \(t\)-test for the higher range of salries. The \(t\)-test shows that we failed to reject the null hypothesis at 95% confidence level, since the \(p\)-value is greater than the significance level.
t.test(data_jobs_df_new$salary_range_to, other_jobs_df_new$salary_range_to, alternative="greater", var.equal=FALSE)
Welch Two Sample t-test
data: data_jobs_df_new$salary_range_to and other_jobs_df_new$salary_range_to
t = -0.077608, df = 71.146, p-value = 0.5308
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
-5588.974 Inf
sample estimates:
mean of x mean of y
88735.88 88984.57
However, if we conduct the same \(t\)-test on the lower salary range, we can see that the \(p\)-value is less than 0.05, as shown below. Therefore, there is sufficient evidence to reject the null hypothesis in favor of the alternative, and we can say that the lower end of the starting salary for data jobs is higher than non-data jobs. The lower range may hold more water since a lot of job postings will try to attract you with the range, but lean towards the lower salary number.
t.test(data_jobs_df$salary_range_from, other_jobs_df$salary_range_from, alternative="greater", var.equal=FALSE)
Welch Two Sample t-test
data: data_jobs_df$salary_range_from and other_jobs_df$salary_range_from
t = 1.7365, df = 75.181, p-value = 0.04328
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
227.0869 Inf
sample estimates:
mean of x mean of y
59624.94 54081.72
Below, we can see the histograms for the lower and higher range salaries of both the data and non-data jobs. These graphs show that the lower salary-range for the data jobs tend to have a greater percentage of their jobs in the the 55-65K range compared to the non-data jobs. However, for the higher end of salary, the difference in distributions is harder to tell. This is consistent with our hypothesis tests.
numbreaks <- 10
up.lim <- 200000
par(mfrow=c(2,2))
hist(data_jobs_df_new$salary_range_from,breaks=numbreaks,xlim=c(0,up.lim),main="Lower Limit Data Jobs",xlab="Salary ($)")
hist(other_jobs_df_new$salary_range_from,breaks=numbreaks,xlim=c(0,up.lim),main="Lower Limit Other Jobs",xlab="Salary ($)")
hist(data_jobs_df_new$salary_range_to,breaks=numbreaks,xlim=c(0,up.lim),main="Upper Limit Data Jobs",xlab="Salary ($)")
hist(other_jobs_df_new$salary_range_to,breaks=numbreaks,xlim=c(0,up.lim),main="Upper Limit Other Jobs",xlab="Salary ($)")Next we import the data for technology jobs within New York City that were posted to dice.com. We separate one of the columns and rename a few others.
raw_dice_df <- read.csv('https://raw.githubusercontent.com/mehtablocker/cuny_607/master/project_3/dice_com_nyc_jobs.csv', stringsAsFactors = F)
dice_jobs_df <- as_tibble(raw_dice_df) %>%
separate(employmenttype_jobstatus, into=c("employment_type", "job_status"), sep = ", ", fill="right", extra = "drop")
dice_jobs_df <- dice_jobs_df %>%
rename(advertiser_url = advertiserurl,
job_description = jobdescription,
job_id = jobid,
job_location = joblocation_address,
job_title = jobtitle,
post_date = postdate)
dice_jobs_df %>% head() %>% datatable()Since this dataset is comprised of only technology jobs, finding specifically data science jobs may require a bit more nuance. For example, if we try to filter for the words “data” or “analytics” as before, we catch a lot of software developer jobs that are not exactly the same subspace as data science.
ds_dice_df <- dice_jobs_df %>% filter(grepl("data|analytics", job_title, ignore.case = T))
ds_dice_df %>% select(job_title, company, employment_type, skills) %>% head() %>% datatable()We can refine our search by excluding words like “engineer” and “architect” to get a more relevant result.
ds_dice_df <- ds_dice_df %>%
filter(!grepl("architect|architecture|engineer|developer|development|administrator|administration", job_title, ignore.case = T))
ds_dice_df %>% select(job_title, company, employment_type, skills) %>% head() %>% datatable()We can text mine the job_description and skills columns to find specific keywords.
How many job postings mention the R programming language?
r_dice_df <- ds_dice_df %>%
filter(grepl(" R | R,", job_description, ignore.case=T) | grepl(" R | R,", skills, ignore.case=T))
nrow(r_dice_df)## [1] 6
r_dice_df %>% head() %>% datatable()Of our 59 filtered job listings, six explicitly mention R.
How many job postings mention Python?
python_dice_df <- ds_dice_df %>%
filter(grepl(" python | python,", job_description, ignore.case=T) | grepl(" python | python,", skills, ignore.case=T))
nrow(python_dice_df)## [1] 8
python_dice_df %>% head() %>% datatable()Of our 59 filtered job listings, eight explicitly mention Python.
To obtain the latest available information about data science job postings, 50 pages of the latest Indeed job postings are scraped using various html nodes in the rvest package. The resulting dataset contains the following fields: job title, company, location, job summary, and link. To identify the correct html nodes that return these fields, the chrome extention SelectorGadget was used in conjunction with inspect element. Regular expressions are also used for each field to remove blank spaces, new lines, and unnecessary information.
listings <- data.frame(title=character(),
company=character(),
location=character(),
summary=character(),
link=character(),
description = character(),
stringsAsFactors=FALSE)
for (i in seq(0, 990, 10)){
url_ds <- paste0('https://www.indeed.com/jobs?q=data+scientist&l=all&start=',i)
var <- read_html(url_ds)
#job title
title <- var %>%
html_nodes('#resultsCol .jobtitle') %>%
html_text() %>%
str_extract("(\\w+.+)+")
#company
company <- var %>%
html_nodes('#resultsCol .company') %>%
html_text() %>%
str_extract("(\\w+).+")
#location
location <- var %>%
html_nodes('#resultsCol .location') %>%
html_text() %>%
str_extract("(\\w+.)+,.[A-Z]{2}")
#summary
summary <- var %>%
html_nodes('#resultsCol .summary') %>%
html_text() %>%
str_extract(".+")
#link
link <- var %>%
html_nodes('#resultsCol .jobtitle .turnstileLink, #resultsCol a.jobtitle') %>%
html_attr('href')
link <- paste0("https://www.indeed.com",link)
listings <- rbind(listings, as.data.frame(cbind(title,
company,
location,
summary,
link)))
}We have successfully scraped job title, company, location, job summary, and job link from 100 pages of Indeed job postings.
datatable(listings)We create a unique ID for each listing based on title, location, company, summary, and description. This unique ID will be used to remove duplicates (since many sponsored posts show up multiple times in Indeed’s postings).
#create a unique id for each job posting attribute combination
listings$uniqueid <- mapply(function(x, y, z) digest(paste0(x,y,z)), listings$title, listings$location, listings$company)
#remove duplicate unique ids
listings %<>%
distinct(uniqueid, .keep_all = TRUE)
#remove duplicate links
listings %<>%
distinct(link, .keep_all = TRUE)
datatable(listings)The summaries provided in the main job listing pages are somewhat limited. Links to each individual job posting were also extracted in the previous step, so it is possible to iterate theough each link and scrape the full job description. To do this, duplicates are removed and rvest is used again to extract full descriptions.
#obtain full description for all job postings
for (i in (1:length(listings$link))){
desciption <- tryCatch(
html_text(html_node(read_html(as.character(listings$link[i])),'.jobsearch-JobComponent-description')),
error=function(e){NA}
)
if (is.null(desciption)){
desc <- NA
}
listings$description[i] <- desciption
}datatable(listings)## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
Now that full job descriptions are available, regular expressions are used for some minor cleanup. Finally, we store the information from the web scraping in an SQL server. Rather than deleting the SQL table everytime this code chunk is run, we append the new listing to the existing SQL server. This allows any additional job listings to be captured without deleting older job listings. To prevent capturing duplicates in the SQL database, we use uniqueid as the primary key and only append listings with distinct unique IDs.
config <- config::get()
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, user=config$user, password=config$password, dbname=config$dbname, host= config$host)
#Query to create new SQL table using uniqueid as the primary key
sqltable <- "CREATE TABLE listings (
title TEXT,
company TEXT,
location TEXT,
summary TEXT,
link TEXT,
description TEXT,
uniqueid TEXT PRIMARY KEY
)"
#Creates new SQL table "listings" using the query above if it does not exist yet
if (!dbExistsTable(con, "listings")){
dbExecute(con, sqltable)
}
#Instead of dropping the table and creating a new one, we append records. Setting uniqueID as the primary key prevents us from appending duplicate job postings to the SQL database. This requires us to append a subset of job listings that are not duplicates. To do this, we import all uniqueid values from SQL and subset listings to exclude duplicates - then import into SQL.
dbWriteTable(con, "dup_listings", listings, row.names = FALSE)## [1] TRUE
dupcheck <- "
SELECT
uniqueid,
MAX(title) as title,
MAX(company) as company,
MAX(location) as location,
MAX(summary) as summary,
MAX(link) as link,
MAX(description) as description
FROM dup_listings
WHERE uniqueid NOT IN (SELECT uniqueid FROM listings)
GROUP BY uniqueid;
"
dupcheck_results <- dbGetQuery(con, dupcheck)
if (nrow(dupcheck_results) > 0) {
dbWriteTable(con, "listings", dupcheck_results, append = TRUE, row.names = FALSE)
}## [1] TRUE
dbRemoveTable(con, "dup_listings")## [1] TRUE
dbDisconnect(con)## [1] TRUE
This step imports all the unique data science job listings that we have saved in a SQL table.
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, user=config$user, password=config$password, dbname=config$dbname, host= config$host)
#Query to get all job listings from SQL database.
sql <- "
SELECT uniqueid, title, company, location, summary, link, description
FROM listings
"
#Show query results
results <- dbGetQuery(con, sql)
dbDisconnect(con)## [1] TRUE
datatable(results)To see which keywords show up most frequenctly in job descriptions, we create a word cloud. From the word cloud, we see that some important skills for a data science are business, machine learning, teamwork, statistics, analytics, research, and modeling. Languages highlighted in the word cloud include Python, SQL, and Java. This word cloud only provides a high-level summary of skills - for a more in-depth analysis, we will search for specific keywords. Note - R is filtered out of this word cloud.
res <- rquery.wordcloud(paste(results$description), type="text",
lang="english", excludeWords = c("data", "experience","will","work"),
textStemming = FALSE, colorPalette="Paired",
max.words=500)Below we search for occurance rates of specific languages. The majority of job postings mention R, Python, or SQL. Java and Scala are less commonly listed, but also seem to be useful skills for many job listings.
#R
r_listings_df <- results %>%
filter(grepl(" R | R,", description, ignore.case=T) | grepl(" R | R,", summary, ignore.case=T))
#Python
python_listings_df <- results %>%
filter(grepl(" python | python,", description, ignore.case=T) | grepl(" python | python,", summary, ignore.case=T))
#Java
java_listings_df <- results %>%
filter(grepl(" java | java,", description, ignore.case=T) | grepl(" java | java,", summary, ignore.case=T))
#SQL
SQL_listings_df <- results %>%
filter(grepl(" SQL | SQL,", description, ignore.case=T) | grepl(" SQL | SQL,", summary, ignore.case=T))
#Scala
scala_listings_df <- results %>%
filter(grepl(" Scala | Scala,", description, ignore.case=T) | grepl(" Scala | Scala,", summary, ignore.case=T))
language_pct <- tibble(
language = c("R", "Python", "Java", "SQL", "Scala"),
mention_pct = c(round(nrow(r_listings_df)/nrow(results),digits=2),
round(nrow(python_listings_df)/nrow(results),digits=2),
round(nrow(java_listings_df)/nrow(results),digits=2),
round(nrow(SQL_listings_df)/nrow(results),digits=2),
round(nrow(scala_listings_df)/nrow(results),digits=2)))
language_pct %>% ggplot(aes(x=language, y=mention_pct)) +
geom_bar(stat="identity", position=position_dodge(), fill="steelblue") +
labs(x = "Language", y = "Percent Mentioned")We also looked into the percent of job postings that mentioned tools such as Tableau, Spark, Hadoop, and SAS. These results show the prevalence of Big Data - given how often Spark and Hadoop are mentioned.
#Tableau
tableau_listings_df <- results %>%
filter(grepl(" Tableau | Tableau,", description, ignore.case=T) | grepl(" Tableau | Tableau,", summary, ignore.case=T))
#Spark
spark_listings_df <- results %>%
filter(grepl(" spark | spark,", description, ignore.case=T) | grepl(" spark | spark,", summary, ignore.case=T))
#Hadoop
hadoop_listings_df <- results %>%
filter(grepl(" hadoop | hadoop,", description, ignore.case=T) | grepl(" hadoop | hadoop,", summary, ignore.case=T))
#SAS
SAS_listings_df <- results %>%
filter(grepl(" sas | sas,", description, ignore.case=T) | grepl(" sas | sas,", summary, ignore.case=T))
tools_pct <- tibble(
tool = c("Tableau", "Spark", "Hadoop", "SAS"),
mention_pct = c(round(nrow(tableau_listings_df)/nrow(results),digits=2),
round(nrow(spark_listings_df)/nrow(results),digits=2),
round(nrow(hadoop_listings_df)/nrow(results),digits=2),
round(nrow(SAS_listings_df)/nrow(results),digits=2)))
tools_pct %>% ggplot(aes(x=tool, y=mention_pct)) +
geom_bar(stat="identity", position=position_dodge(), fill="maroon") +
labs(x = "Tool", y = "Percent Mentioned")We thought it might be interesting to see the frequency of jobs by location. We wanted to see the frequency by both city and state, so we created to new data frames to look at this:
listings.city <- listings[,1:3]
listings.state <- listings.city %>% separate(location, c("City","State"), sep = ",")Now, we can create frequency tables for both:
locations.city <- as.data.frame(table(listings.city$location))
locations.state <- as.data.frame(table(listings.state$State))
head(locations.city[order(locations.city$Freq,decreasing = TRUE),],10) Var1 Freq
6 New York, NY 200
19 San Francisco, CA 74
41 Boston, MA 29
28 Santa Clara Valley, CA 23
8 Washington, DC 22
26 Seattle, WA 21
57 Chicago, IL 15
96 San Jose, CA 12
36 Sunnyvale, CA 11
9 Bellevue, WA 10
head(locations.state[order(locations.state$Freq,decreasing = TRUE),],10) Var1 Freq
27 NY 238
4 CA 191
16 MA 55
39 WA 37
35 TX 31
37 VA 25
7 DC 22
31 PA 22
13 IL 21
21 NC 15
Unexpectedly, we see that New York City does not have the most Data related jobs, but Princeton, NJ does. In addition, San Francisco is ranked 8th on the list, where we would expect it to be higher ranked that.
When we look at the results by state, we see that NJ has by far the most data related jobs, followed by California and New York. It is interesting that NJ has the most data jobs considering its small size compared to other states. However, it makes sense due to its close proximity to New York City.